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Abstract — We propose a two-step algorithm for the construc- 
tion of a Hidden Markov Model (HMM) of assigned size, i.e. 
cardinality of the state space of the underlying Markov chain, 
whose n-dimensional distribution is closest in divergence to a 
given distribution. The algorithm is based on the factorization 
of a pseudo Hankel matrix, defined in terms of the given 
distribution, into the product of a tall and a wide nonnegative 
matrix. The implementation is based on the nonnegative matrix 
factorization (NMF) algorithm. To evaluate the performance of 
our algorithm we produced some numerical simulations in the 
context of HMM order reduction. 

I. Introduction 

Hidden Markov models (HMM) are a simple, yet very 
rich, class of stochastic processes which has become ubiqui- 
tous in several areas of signals, systems, and control. Here we 
restrict attention to HMMs (Y t ),t = 1,2,... taking values 
into a finite set y. The HMMs of this type can always be 
represented as deterministic functions of some Markov chain 
(St),t = 1,2,..., taking values in a finite set S, in the 
sense that Y t = f(St), for all t E N, where the symbol = 
means that the processes (Yt) and (f(St)) have the same 
distributions. The cardinality of S is called the size of the 
representation Y t = f(S t ). 

Since functions of Markov chains generally loose the 
Markov property, HMMs can be used to model also dy- 
namical behaviors exhibiting complex dependencies from the 
past, which cannot be described within the class of Markov 
chains. At the same time HMMs admit a simple parametric 
description, through the transition probabilities matrix of 
(St) and the deterministic function /. Being flexible and easy 
to describe it comes as no surprise that the class of HMMs 
is extensively employed in many applications to real data. 
The applications span the fields of engineering (modeling 
stochastic automata, for automatic speech recognition and 
for communication networks), genetics (sequence analysis), 
biology (to study neuro-transmission through ion-channels), 
mathematical finance (to model rating transitions, or to solve 
asset allocation problems), and many others. 

Although by now many approaches to solve inferential 
problems about HMMs have been proposed in the engineer- 
ing and statistical literature, an algorithm to solve the exact 
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stochastic realization problem from distributional data is still 
missing. Even the weakest form of stochastic realization, 
given the finite dimensional distributions py(Y™) of a HMM 
find the parameters of any of its representations (f(St)), has 
not yet been solved satisfactorily. The early attack dates back 
to [1], and [2], the stochastic realization point of view was 
introduced later in [3] and [4], the most recent results can 
be found in [5] and [6]. 

In the present paper we focus on the simpler approximate 
realization problem which can be roughly formulated as 
follows. Given the distributions qy(Y{ 1 ) of a stationary 
process, find a realization of a HMM of assigned size, which 
best approximates qyiY™) in divergence rate, a most natural 
criterion of closeness between distributions. Unfortunately 
there exists no general, closed form, analytic expression of 
the divergence rate between HMMs [7], [8], let alone of the 
divergence rate between a general stationary process and a 
HMM. To obviate the difficulty we formulate an alternative 
criterion, in terms of the informational divergence between 
nonnegative, pseudo Hankel matrices, representing the finite 
dimensional distributions of the processes. The approximate 
realization problem becomes then amenable to the use of 
Nonnegative Matrix Factorization (NMF) techniques. This 
approach was already investigated in [9], and [10], where we 
proposed a three step, NMF based, optimization procedure to 
construct the parameters of the best approximate realization. 
The same approach, in slightly different contexts, has been 
later followed in [11] and [12], the latter proposing a two 
step algorithm based on estimated data instead of exact 
distributional data, with the second step carried out via linear 
programming. 

The remainder of the paper is organized as follows. 
Section III] contains preliminaries on HMMs. In Section III 
we introduce the pseudo Hankel matrix of the finite di- 
mensional distributions of a stationary process and study its 
factorization properties for the class of HMMs. Section IV 
introduces the two-step approximation algorithm. The last 
Section discusses some of the numerical issues and provides 
examples of order reduction for HMMs. 

II. Preliminaries 

A stochastic process (Yt)t&i with values in y = 
{1,2,... m} is a stationary hidden Markov model (HMM) 
of size N if for some process (X t )teN, with values in 
X = {1, . . . N}, the pair (X t ,Y t ) is jointly stationary and, 
for all y ey, j e X, 



P(Y t+1 = y,X t+l =j\Xr,Yf) = 
P(Y t+1 =y,X t+1 =j\X t ). 



(1) 



From ^ it follows that both, the pair (S t ) := (X t ,Y t ), 
and (X t ) alone, are Markov chains, a property that shows 
why a HMM can always be described as a determinis- 
tic function of a Markov chain, i.e. the projection onto 
the second component of (St). The distribution functions, 
Vy{V\ ■■■Vn) = P(YT = !/?), of (Y t ) are specified by the 
m nonnegative N x N matrices 

m ij(y) = p ( Y t+i = y,X t +i =j\X t = i), 

and by a row vector ix = irA, where A — ^ M(y) is the 
transition probabilities matrix of (X t ). For any (finite) string 
w = Vi ■ ■ ■ y n one gets 

p Y (w)=wM(y 1 )---M(y n )e, 



where e = (1, . . . , 1) T G 



a AT 



Define M{w) = M(yf) 



M(yi)M(y 2 ) ■ ■ ■ M{y n ). It follows that, for any pair of 
(finite length) strings u and v, denoting by uv their con- 
catenation, one has 



Py{uv) = irM(u)M{v)e —: tt(u)j(v) 



(2) 



For any finite length strings u and v and any y 6 y the 
vectors n(u) :~irM{u) (row) and j(v) :—M(v)e (column) 
satisfy the following relations 



Tt{uy) = n(u)M(y), n(n) 



]ir(yu), 



l(yv) = M(y)j{v), 7(v) = ^7(wy) 



(3) 



In common usage the recurrence equations (the ones on the 
left) are called respectively forward equation (for 7r), and 
backward equation (for 7). 

III. Hankel matrices and their factorizations 
We introduce two lexicographic orders on the set y n . The 
first lexical order (flo), increasing from right to left, and 
the last lexical order (llo), increasing from left to right. By 
way of example, if y — {0,1} and n = 2, the strings in 
increasing flo are (00,10,01,11), while, in increasing llo, 
are (00,01,10, 11). 

Definition 1: (Hankel matrices of stationary processes) If 
Qy{') is the finite dimensional distribution of a stationary 
measure Q, 

C := \\QY(UrV s )\\r,s, 

where u r and v s run through all of y n in flo and in llo 
respectively. 

The matrix H® n is generally called the Hankel matrix associ- 
ated to Q, see e.g. [5], but it only has a structural resemblance 
to a genuine Hankel matrix. The Hankel matrices of HMM 
measures admit several nonnegative factorizations stemming 
from d2|. Specifically, if py is the finite dimensional distri- 
bution of a stationary HMM measure P, 

7r(ui) 



H, ; 



TT{Ue) 



7(«i) 



70^) 



where II„, and T n are matrices of sizes m n x N and Nxm" 

respectively {£ = to"). 

The matrices II„, and T n , in turn, can be factored as follows 

II„ = [n n _!M(l); . . . ; n„_!M(m)] , (5) 

r„ = [M(i)r n _i, . . ,,M(m)r n _i] , (6) 

where ";" and "," denote, respectively, vertical and horizontal 
stacking of the blocks. Introducing the matrix 



n„ := [n n _!M(i), . . .,n n _!M(m)] e 

equations (J5l) can be compactly written 



L xmiV 



n„ 



n„-iM, r n = Mr (n _ 1) , 



where 



M := [M( yi ),...,M(y m )}eR^ mN 

r ( „_ 1} := diag(r n _ 1 ...r n _ 1 )eKf Vxm ". 



(7) 



(8) 



Remark 1: Note that both II n _i and r n _i are readily 
obtained from II n and T n by way of the marginal relations 
(the ones on the right) given in (|3j. 

IV. Approximation problem and two-step 
algorithm 

We now pose the problem of best approximation of a 
given stationary process with a HMM of assigned size, con- 
vert it into an approximate nonnegative matrix factorization 
(NMF)JJand propose an algorithm for its numerical solution. 
The following definition will be used throughout the paper. 

Definition 2: (Divergence between positive matrices) Let 

M,N€lR" lxn , 



Z?(M||N) :=]>>//„■ log ^ 



My + Ny). 



(9) 



We are now ready to pose the approximation problem as a 
nonnegative matrix factorization (NMF) problem. 

Problem 1: Given a stationary probability measure Q on 
y°° and N e N, find the parameters {M*(y), y e y} of 
a HMM of size N whose Hankel matrix H r * m = n^r; 
(n > 2iV) is closest to H^ n in divergence. More compactly: 
solve the minimization problem 



n min D(H£jn„r n ), 



(10) 



(4) 



under constraints II n ,r„ > 0, e T n„e = 1, and T n e = e. 

The motivation for this setup comes from the fact that the 
distributions of a HMM of size N are fully determined by 
any of its Hankel matrices H„„ with n > 2N see e.g. [14]. 
The constraints are imposed by the definitions of the factors 
II„ and r„ given above. 

A minimizing nonnegative factorization (II*, T*) always 
exists, see [15], Proposition 2.1, but Problem [T] also calls 
for the construction of the corresponding parameters M*(y). 

'it has become commonplace in the applied literature to reserve the 
acronym NMF to denote the approximate Nonnegative Matrix Factorization 
problem, made popular by [13]. 



The analysis of the ideal case will serve as a guide. If Q were 
a HMM law, the following exact nonnegative factorizations 
would hold by equations (HI and (J7]i 



h q = n Q r 

- m -- m -ir-,ir> *-"-»i ■*■ 



n n i 

MTm), 



(ii) 

(12) 



This can be considered as an ideal algorithm. Feeding into 
the system (111, (12i the input H^„, which is known since 



Q is given, produces the output M, whose blocks contain 
the parameters M(y) sought for. In real situations these 
exact factorizations are generally not valid since Q might 
not be a HMM, or might be a HMM of order larger than N. 
This suggests constructing a two step algorithm where (111 
and ( p~2] > are substituted with NMFs. The scheme below 
illustrates the two steps. 

Step 1: Law Approximation Step 

Given: H«„ 
Problem: min D(H£„||II n r„) 

constraints e T II n e = 1, T n e = e 
Solution: U* n , T*. 

Note that 11^ and T* are of respective sizes (to" x N) 
and (N x to"). Using equation ffify one can build the matrix 
r? n _ 1 - ) needed below. 

Step 2: Parametrization Step (version with T) 
Given: r*, r^_ 1} 
Problem: mm£> (r* ||Mr^_ x) ) 

constraint Me = e 
Solution: M* = [M* ( Vl ) . . . M* (y m )} . 



Note that the constraint Me = e, imposed at Step 2, 
corresponds to the requirement that the transition matrix of 
the underlying Markov chain be stochastic. The resulting 
A* = ^2 M* (y) can be used to compute the parameter 

7T* = TT*A*. 

Step 1 of the algorithm behaves like a typical EM method, 
with convergence of the NMF algorithm to local minima 
of the divergence and dependence on the initial conditions 
11^, r°. Step 2 behaves much better, as one of the factors 
is fixed. The following Lemma summarizes the convergence 
properties of NMF algorithms in the general setup of Step 2 
of the algorithm. Although an easy consequence of the results 
in [16], to the best of our knowledge the Lemma was not 
introduced before, at least not in the literature on NMF 
method 

Lemma 1: Let S and T be given stochastic matrices 
of sizes to x to and N x to respectively. Let Mo be a 
given stochastic matrix of size m x N with strictly positive 
elements. The NMF iterative algorithm [13], [15] applied to 
the problem 

minD{S\\MT) (13) 



with initial condition M , produces a sequence of matrices 
Mfc — > At* (elementwise) where M* is the minimizer of 

D(S\\MT). 

The proof follows directly from Theorem 5 of [16], once it 
is recognized that the problem is decoupled in the rows of 

M. 

The analysis of the equations (R} and d7t, valid in the ideal 
HMM case, shows that as an alternative one could write the 
system 



H Q 

nn 

n„ 



n„ iM. 



This suggests that it would be possible to substitute Step 2 of 
the proposed algorithm with the alternative Step2.alt below. 
Note that from the output TL* n of Step 1, one can construct, 
as noted in RemarkJT] the matrix U* n _i and also repackage 
it, using equation nfj, to form the matrix n^. 

Step 2. alt: Parametrization Step (version with Tl) 



Given: II 
Problem 



n 



n-l 



min D 
M 

constraint Me = e 



(n^njuMj 

int Me = e 
Solution: M* - [M* ( Vl ) . . . M* (y m )} . 



V. Numerical Simulations 

In this section we show the results of numerical simula- 
tions designed to assess the performance of the algorithm 
on HMM order reduction problems. We have also tried to 
evaluate possible differences in performance related to the 
version of Step 2 being used. We selected two examples. In 
both cases the given process is a 4 state HMM with binary 
output. 

Example 1: The given HMM is specified by its transition 
probabilities matrix A and its read-out matrix B. Recall that 
M(y) = AB y where B y is the diagonal matrix with the y-th 
column of B on the diagonal. For this example y <E {0,1}. 
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Example 2: Same read-out matrix B as Example 1. The 
transition probabilties matrix is 



A 2 = 
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A. Discussion of Step 2 of the Algorithm 

As seen at the end of the previous section, Step 2 of the 
algorithm can be implemented in two versions. We will call 
the first the T version and the alternative the II version. By 
Lemma 1, and a variant of it, both NMFs are guaranteed 
to converge to the global optimum, irrespective of the initial 
condition M°, We are interested in evaluating numerically if 
there are significant differences in the speed of convergence 
of the T and II versions of Step 2. For the purpose of 
this comparison Step 1 was fixed and taken as an order 
reduction step, from size 4 to size 2, for both Example 1 and 
Example 2. We tested, on each example, the T and the II 
versions of Step 2, running for each 30 NMFs choosing M° 
randomly. The results were compared in terms of speed of 
convergence of the divergence and variability of the resulting 
M*'s. As an index of variability of the M* parameters we 
computed the sum of the sample variances of the elements 
of M* resulting for each of the 30 runs, i.e. 



R = 



N 2JV 

EE 



i 



T 



T E( M # 



M^f 



where N = 2, is the size of the reduced HMM, T = 30 is 
the number of different initial conditions M°, and M*, = 

^Y^=i M ij- Fi 8- and Fi g- $ relative to Example 1, 
display the variability index R and the averaged (over the 
30 runs with randomly chosen M°) divergence decay against 
the number of NMF iterations. Fig. [3] and Fig. [4] show the 
same for Example 2. 

As expected, both versions of Step 2 are not sensitive 
to the initial conditions and the NMF iterations converge. 
What is particularly pleasing is that the resulting M* is the 
same for both versions of Step 2. The divergence appears to 
converge much faster than the parameter M*, a clear sign 
of the flatness of the criterion near optimality. Indicating 
with MJ- and MJj the parameters of the best approximation 
obtained with the two versions of Step 2, we see that, for 
Example 1 

» _ / 0.26633 0.24781 0.31323 0.17263 
r ~ \ 0.20449 0.44350 0.21345 0.13856 



M 



n 



0.26633 0.24778 0.31324 0.17265 
0.20453 0.44349 0.21344 0.13855 
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Fig. 2. Example 1 - Divergence decay in Step 2 (average of 30 runs), 
r in black, II in blue. 




2000 3000 

Iterations 

Fig. 3. Example 2 - Variability of M* in Step 2 (average of 30 runs), 
r in black, II in blue. 
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Fig. 4. Example 2 - Divergence decay in Step 2 (average of 30 runs), 
r in black, II in blue. 



and for Example 2 



m; 



0.28811 
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Fig. 1. Example 1 - Variability of M* in Step 2 (average of 30 runs), 
r in black, II in blue. 



M 



n 



0.28814 0.15364 0.29571 0.26251 
0.25937 0.25451 0.14342 0.34271 



Results along the same lines were obtained running many 
examples, corresponding to different sizes of order reduc- 
tion, and/or different given HMM processes. The speeds 
of convergence of the divergence for the two versions of 
Step 2 remain always comparable. The variability of M* 
is sometimes slightly different for the two versions. The 
conclusion we drew is that there is not a version that 
systematically outperforms the other. For top performance 
in terms of variability it seems more appropriate to evaluate 
on a case by case basis which version is most suitable. 



B. HMMs order reduction 

We collect here the results of four experiments of HMM 
order reduction. For each of the HMMs of size 4 given as 
Example 1 and Example 2, we ran the algorithm to produce 
HMMs of order reduced to 3 and 2. In all cases Step 2 
was performed in the T version. The set-up of the algorithm 
was fixed as follows: 3,000 NMF iterations for Step 1 in all 
cases, 3,000 and 20,000 NMF iterations for Step 2 for the 
order reductions 4 — » 2 and 4 — > 3 respectively. The number 
of NMF iterations was decided fixing a threshold on the M* 
variability index, R. We randomly chose 30 initial conditions 
(Tl n , T°) for Step 1. As the NMF problem posed in Step 1 
is sensitive to initial conditions (it behaves like a standard 
EM) we obtained 30 different pairs (II*, T*) at its output. 
We therefore had 30 different NMF problems as input to 
Step 2 leading to different M*. In the following tables we 
report the divergence values, before and after each of the 
two steps of the algorithm (DIVlb, DIV1, DIV2b, DIV2) 
and the final divergence (DIV) between the original process 
and the best approximation obtained with the resulting M*. 
Specifically Table [I] and Table [E] refer to the HMM order 
reduction 4 — > 2 for Example 1 and Example 2 respectively. 



TABLE II 
Example 2 - order reduction 4 — > 2 - divergence decays 



Likewise Table III and Table IV report the results of the order 
reduction 4—^3 for the two examples. The high variability 
of the divergence after Step 2 (DIV2), especially in the case 
4 — > 3, is not surprising. It depends on the fact that each run 
corresponds to a different NMF problem for Step 2. 

It is interesting to notice that the final divergence (DIV) 
between the original process and the best approximation have 
the same order of magnitude, irrespective of the random 
initializations for the NMFs of the two steps. In Fig. B] 
relative to Example 1, and Fig. [6] relative to Example 2, 
are plotted the divergences between the original process and 
the output process of the algorithm for the order reduction 
case 4^2. Fig. U\ and Fig. M show the case 4 — > 3. In the 
plots the best numerical approximations are highlighted in 
red. 



RUN 


DIVlb 


DIVlxlO 8 


DIV2b 


DIV2X10 4 


DIVxlO 6 


1 


0.015 


6.601 


0.046 


4.198 


1.020 
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0.014 


6.601 


0.886 
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1.007 
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0.014 


6.601 


0.154 


3.915 


1.018 
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0.014 
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1.007 
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0.013 


6.601 


0.212 


3.468 


1.016 
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0.013 


6.601 


0.452 


1.814 


1.009 
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0.013 


6.601 


0.137 


1.612 


1.008 


9 


0.014 


6.601 


0.076 


1.014 


1.007 


10 


0.014 


6.601 


0.287 


3.579 


1.017 


11 


0.013 


6.601 


0.154 


1.761 


1.009 


12 


0.014 


6.601 


0.219 


2.330 


1.011 


13 


0.014 


6.601 


0.364 


1.666 


1.009 


14 


0.015 


6.606 


0.273 


0.496 


1.007 


15 


0.014 


6.601 


0.215 


1.148 


1.007 



TABLE III 
Example 1 - order reduction 4 — > 3 - divergence decays 



RUN 


DIVlb 


DIVlxlO 9 


DIV2b 


DIV2X10 3 


DIVxlO 7 


1 


0.022 


6.468 


0.126 


28.306 


1.052 


2 


0.022 


6.483 


0.033 


4.616 


1.053 
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6.470 


0.253 


0.022 


1.059 
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11.444 


0.055 


0.493 


1.034 
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0.021 


6.438 


0.419 


0.015 


1.146 
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1.693 


1.049 
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1.050 


8 


0.022 
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0.030 


1.059 
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0.017 


1.012 
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6.452 


0.352 
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1.041 
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TABLE I 
Example 1 - order reduction 4 — > 2 - divergence decays 



TABLE IV 
Example 2 - order reduction 4 — > 3 - divergence decays 
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DIVlb 


DIVlxlO 9 
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RUN 


DIVlb 


DIVlxlO 8 
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DIV2X10 2 
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0.027 


6.470 


0.031 


11.69 


1.051 
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1.051 
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0.009 
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6.136 
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0.008 
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14 


0.009 
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1.006 
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5.667 


1.049 


15 
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